The ABC recently publish an interesting piece of data-journalism exploring “what makes Fremantle Dockers captain Nat Fyfe so good?” The authors put forth a compelling, data-driven case for the specific attributes that make Nathan Fyfe stand out from other players

Below is an eye-catching visualization from the article:

Alt text The scatterplot above depicts contested possessions (x-axis) by contested marks (y-axis). Each blue data-point is a player’s statistical average on these metrics for a single season

It shows between 2014 and 2019, Nat Fyfe was something of a statistical anomaly - exhibiting a rare combination of elite contested ball winning ability and aerial dominance. No other player in the competition was able to combine these two traits to such a level

To this end, the authors ordain Nat Fyfe as the The Flying Bulldozer

This got me thinking what other players demonstrate anomalously good skill-profiles for specific combinations of metrics?

A flexible way to answer this question is via an interactive shiny application that loosely replicates the above scatterplot, but also allows end-users to input match-day metrics of their choosing

This walkthrough will cover how to do this, summarizing:

Let’s dive in!

The Data

As per my previous post, I have utilized the “FitzRoy” package to extract data from AFLTables & Footywire

Referencing this useful link by the Fitzory package author, I was able to find a faster, more data-rich method to source this data

In my previous post, extracting data was relatively time-consuming, & required imputation of match_ids corresponding to the range of games

The below code snippet only requires the year(s) of interest to be imputed

library(dplyr)
library(tidyr)
library(ggplot2)
library(plotly)
library(fitzRoy)
library(shiny)
library(shinyWidgets)
library(ggthemes)
library(tableHTML)
library(kableExtra)
library(knitr)
afl.1 <- get_fryzigg_stats(start = 2017, 
                           end = 2020)

The structure of this extract includes all AFL premiership season matches from 2017 to the present. Each individual match is associated with 44 rows, one for each player who participated in the game. Over four seasons of football, this accumulates to 33,264 rows of data

Re-shape the data to player averages

Two key steps required to transform this data are:

To make things more manageable, let’s filter to only the fields of interest

afl.2 <- afl.1 %>% 
  select(# player details
         player_id,
         player_first_name,
         player_last_name,
         # player statistics 
         kicks,
         handballs,
         disposals,
         disposal_efficiency_percentage,
         goals,
         behinds,
         hitouts,
         tackles,
         rebounds,
         inside_fifties,
         clearances,
         clangers,
         free_kicks_for,
         free_kicks_against,
         contested_possessions,
         uncontested_possessions,
         brownlow_votes,
         contested_marks,
         marks_inside_fifty,
         one_percenters,
         bounces,
         goal_assists,
         time_on_ground_percentage,
         centre_clearances,
         stoppage_clearances,
         score_involvements,
         metres_gained,
         turnovers,
         intercepts,
         tackles_inside_fifty,
         contest_def_losses,
         contest_def_one_on_ones,
         contest_off_one_on_ones,
         contest_off_wins,
         def_half_pressure_acts,
         effective_kicks,
         f50_ground_ball_gets,
         ground_ball_gets,
         hitouts_to_advantage,
         intercept_marks,
         pressure_acts,
         ruck_contests,
         score_launches,
         spoils
         )

& calculate the average match-day statistics for each player

This is achieved by grouping the data by player, then using summarise_at to calculate the arithmetic average (i.e. mean) for all 43 match-day statistics

# Aggregate
player <- afl.2 %>% 
  group_by(player_id,
           player_first_name,
           player_last_name) %>%
  summarise_at(.vars = colnames(.)[4:47], mean) # calculate mean for all stats 
# Ensure table is dataframe
player <- as.data.frame(player)
# round numeric columns to 2 decimal places 
player <- player %>% mutate_if(is.numeric, ~round(., 2))

Which produces the below table:

player_id player_first_name player_last_name kicks handballs disposals disposal_efficiency_percentage goals behinds hitouts tackles rebounds inside_fifties clearances clangers free_kicks_for free_kicks_against contested_possessions uncontested_possessions brownlow_votes contested_marks marks_inside_fifty one_percenters bounces goal_assists time_on_ground_percentage centre_clearances stoppage_clearances score_involvements metres_gained turnovers intercepts tackles_inside_fifty contest_def_losses contest_def_one_on_ones contest_off_one_on_ones contest_off_wins def_half_pressure_acts effective_kicks f50_ground_ball_gets ground_ball_gets hitouts_to_advantage intercept_marks pressure_acts ruck_contests score_launches spoils
11084 Robert Murphy 12.24 7.53 19.76 77.00 0.59 0.41 0.00 2.12 1.47 4.06 0.71 2.18 0.59 0.71 4.41 15.24 0.00 0.24 0.53 0.94 0.82 0.71 81.53 0.12 0.53 5.29 378.53 4.18 2.59 0.29 0.24 0.53 0.76 0.47 4.18 9.12 0.35 3.53 0.00 0.41 8.71 0.00 0.94 0.65
11095 Drew Petrie 8.25 3.81 12.06 67.56 1.00 0.19 12.19 3.31 0.81 1.75 1.75 2.44 1.50 1.38 7.00 6.00 0.00 1.31 1.00 4.06 0.00 0.25 81.94 0.69 1.06 3.75 196.12 2.62 1.94 0.81 0.19 0.62 1.75 0.56 5.31 5.06 0.38 3.31 3.88 0.62 12.94 34.88 1.06 3.38
11139 Scott Thompson 5.00 5.00 10.00 80.00 0.00 0.00 0.00 5.00 1.00 4.00 1.00 5.00 0.00 5.00 5.00 6.00 0.00 0.00 0.00 1.00 0.00 1.00 72.00 1.00 0.00 3.00 220.00 4.00 4.00 0.00 0.00 0.00 1.00 0.00 14.00 4.00 1.00 4.00 0.00 1.00 24.00 0.00 1.00 0.00
11153 Nick Riewoldt 10.94 5.88 16.82 76.00 1.71 0.71 0.06 2.24 0.47 2.29 0.82 1.71 1.18 0.35 5.29 11.82 0.24 1.24 2.18 1.59 0.06 0.53 89.76 0.00 0.82 6.47 203.35 2.82 1.18 1.24 0.12 0.24 2.76 0.82 2.53 7.53 1.12 2.65 0.00 0.29 10.53 0.12 0.41 0.94
11170 Gary Ablett 14.34 10.86 25.20 68.66 1.00 0.75 0.00 4.28 1.66 4.94 3.94 4.12 0.91 1.58 10.19 15.02 0.64 0.23 0.58 0.69 0.36 0.73 85.89 1.50 2.42 7.11 417.39 5.08 2.23 0.77 0.08 0.14 0.77 0.28 5.88 8.80 1.84 7.81 0.00 0.12 15.66 0.02 1.39 0.20
11192 Shaun Burgoyne 9.18 8.07 17.25 76.87 0.60 0.41 0.00 3.69 1.53 2.79 2.37 2.15 0.78 1.01 7.44 9.51 0.07 0.32 0.51 2.24 0.25 0.59 77.19 0.90 1.47 4.91 239.76 2.62 3.07 0.59 0.19 0.69 0.79 0.22 6.78 6.29 0.99 4.93 0.00 0.46 15.21 0.00 1.18 1.06

Additional Fields

Number of games

A consideration for this cohort is the presence of outliers that are caused by players who have only taken part in a small number of games. If a player had 3 exceptional games before getting injured/retiring, their average performance may not be especially representative of what would be expected over a longer period of time

Of course, it’s nice to keep these players in the dataset. By deriving a variable of the # of games played, users can set this parameter to include/exclude players as they see appropriate. What constitutes a representative sample of games per player is ultimately subjective, but I have set the default for the app at 22 games - equivalent to one season of football, excluding finals

Calculating the number of games requires aggregation of player_id then a join back to the player-statistics table

# Extract out number of games & join back 
player <- afl.2 %>%
  group_by(player_id) %>% 
  summarise(Games = dplyr::n()) %>% 
  left_join(player, by = "player_id")

Player Team

The team of the player is included in the original dataset that has individual match-stats

However players routinely change teams as part of trades & free-agency. This complicates things as some players will have played in multiple teams. As such, we are specifically interested in a players most recent team

This can be elucidated by ordering the original dataset from most recent to least recent game, then filtering to keep only a distinct record of the player_id & player_team from their most recent match

# extract most recent team 
player <- afl.1 %>% 
  arrange(desc(match_id)) %>% # to get most recent team 
  select(player_id, player_team) %>% 
  distinct(player_id, .keep_all = TRUE) %>% # only distinct records 
  left_join(player, by = "player_id") 

Player Position

Player position is a useful filter to narrow the data to only players for which a statistic is relevant. For example, a defender would be expected to perform well on spoils, but not score involvements. As such, the option to limit the data to defenders would be welcome

By extracting out the unique elements of the player_position field …

position <- afl.1 %>% 
  select(player_id, 
         player_position)

unique_positions <- unique(position$player_position)
unique_positions
##  [1] "INT"  "FF"   "FPL"  "R"    "BPR"  "C"    "WL"   "FPR"  "RK"   "HFFR"
## [11] "CHB"  "FB"   "HFFL" "BPL"  "CHF"  "HBFL" "WR"   "RR"   "HBFR"

We observe 19 distinct positions, corresponding to the specific position of the selected team for each match … i.e. left forward pocket & right half back flank. For the present analysis, this level of positional-granularity isn’t required. We’re more interested in broadly distinguishing …

  • Forwards
  • Midfielders
  • Defenders
  • Rucks

Let’s bucket the positions into these broader categories

# Bucketed positions 
# Midfielders
position$PositionType[
  position$player_position ==  "C" | 
    position$player_position  == "RR" | 
    position$player_position  ==  "R" | 
    position$player_position  ==  "WL" | 
    position$player_position  ==  "WR"] <- 
  "Midfield"

#Defenders
position$PositionType[
  position$player_position ==  "BPL" | 
    position$player_position  == "FB" | 
    position$player_position  ==  "BPR" | 
    position$player_position  ==  "HBFL" | 
    position$player_position  ==  "HBFR" | 
    position$player_position  ==  "CHB"] <- 
  "Defender"

#Forwards
position$PositionType[
  position$player_position ==  "FPR" | 
    position$player_position  == "FF" | 
    position$player_position  ==  "HFFL" | 
    position$player_position  ==  "FPL" | 
    position$player_position  ==  "HFFR" | 
    position$player_position  ==  "CHF"] <- 
  "Forward"

#Rucks
position$PositionType[
  position$player_position ==  "RK"] <- 
  "Ruck"

# Interchange 
position$PositionType[
  position$player_position ==  "INT"] <- 
  "Interchange"

Creating a new variable Position Type

Player position in this data is captured on a per game basis, meaning some players will accrue multiple positions. For example, Dustin Martin is named as a forward in some games, and on the ball in others

One way to navigate this is to count all the positions each player has been named in, then filter to only the position with the highest count of games for each player (a proxy for their most common position)

# count how often players named in a specific position
position2 <- position %>% 
  group_by(player_id, 
           PositionType) %>% 
  summarise(Position_Count = dplyr::n()) # count of positions per player

# Order by count of positions, for each player 
position2 <- position2[with(position2, 
                            order(player_id, 
                                  Position_Count)), ]

# keep only the 'position' with the max count
position2 <- position2 %>% 
  group_by(player_id) %>% 
  top_n(1, Position_Count)

# For rare cases where two positions have an equal game count, keep only one position
position2 <- position2 %>% distinct(player_id, .keep_all = TRUE)

# remove position count 
position2$Position_Count <- NULL

# Join back to dataframe 
player <- left_join(player, position2, by = "player_id")

This is satisfactory in that every player has now been ascribed a unique position

However, one further issue is that players who started on the bench more than on the ground have been designated interchange - strictly speaking, not a position.

I explored removing interchange entirely before filtering. However, of the ~ 950 total players, 53 had only been named on the interchange. By removing interchange, some players were left without a position

This gap could be remedied be googling, then manually imputing the position of these players. But for all intents & purposes (ergo laziness of the author), a small number of player-positions designated as interchange will not dramatically impact the visualization

As such, our metric is strictly speaking, most commonly named starting position type

Final Tidying

To finish up, I have cleaned up the dataframe (re-order variables, changed column types removed redundant columns)

# remove ID
player$player_id <- NULL

# combine first and last name
player <- player %>% 
  unite(Player_Name,
        c(player_first_name, player_last_name),
        remove = TRUE)

# remove underscore
player$Player_Name <- gsub("_", " ", player$Player_Name, fixed = TRUE)

# Re-order Columns
player <- player[,c(3, 1, 2, 48, 4:47)]

# team as factor
player$player_team <- as.factor(player$player_team)
# position as factor
player$PositionType <- as.factor(player$PositionType)

# games as numeric
player$Games <- as.numeric(player$Games)

Resulting in the below final dataset:

Player_Name player_team Games PositionType kicks handballs disposals disposal_efficiency_percentage goals behinds hitouts tackles rebounds inside_fifties clearances clangers free_kicks_for free_kicks_against contested_possessions uncontested_possessions brownlow_votes contested_marks marks_inside_fifty one_percenters bounces goal_assists time_on_ground_percentage centre_clearances stoppage_clearances score_involvements metres_gained turnovers intercepts tackles_inside_fifty contest_def_losses contest_def_one_on_ones contest_off_one_on_ones contest_off_wins def_half_pressure_acts effective_kicks f50_ground_ball_gets ground_ball_gets hitouts_to_advantage intercept_marks pressure_acts ruck_contests score_launches spoils
Jarrod Witts Gold Coast 77 Ruck 7.06 6.13 13.19 64.44 0.09 0.21 38.47 3.10 1.16 1.97 4.04 2.47 1.61 1.03 7.96 5.21 0.05 0.45 0.25 2.84 0.03 0.17 84.64 1.30 2.75 3.49 167.13 2.40 1.91 0.48 0.17 0.68 0.84 0.16 5.56 3.52 0.55 4.69 10.52 0.58 10.91 75.06 2.56 2.25
Lachie Weller Gold Coast 76 Midfield 11.62 8.08 19.70 72.45 0.30 0.26 0.00 2.59 3.38 3.03 1.83 2.54 0.70 0.61 5.70 13.05 0.04 0.11 0.07 1.39 1.00 0.22 83.71 0.42 1.42 3.37 393.43 4.36 3.55 0.28 0.20 0.66 0.08 0.00 7.96 7.58 0.47 4.43 0.00 0.49 14.00 0.00 0.92 1.01
Ben Ainsworth Gold Coast 56 Forward 8.32 5.95 14.27 65.95 0.71 0.68 0.00 2.46 0.57 2.77 1.21 2.43 0.84 0.66 5.55 8.84 0.00 0.29 0.68 0.73 0.21 0.45 75.27 0.32 0.89 4.07 231.70 3.14 1.57 0.79 0.04 0.05 0.91 0.16 5.73 4.84 1.07 4.20 0.00 0.11 15.64 0.00 0.57 0.32
Noah Anderson Gold Coast 15 Interchange 8.73 7.00 15.73 61.73 0.27 0.27 0.00 1.87 0.93 2.53 2.00 2.40 0.60 0.53 6.67 9.13 0.00 0.13 0.33 1.07 0.20 0.27 75.80 0.40 1.60 3.27 248.40 3.87 2.60 0.20 0.07 0.13 0.27 0.07 6.47 4.20 0.27 5.53 0.00 0.20 12.20 0.00 0.60 0.73
Mitch Robinson Brisbane Lions 64 Midfield 11.86 7.56 19.42 69.73 0.62 0.36 0.00 4.09 1.81 3.92 2.78 3.03 1.23 1.16 7.75 11.64 0.09 0.48 0.47 1.84 0.08 0.47 79.77 0.84 1.94 4.62 362.67 4.09 3.30 0.69 0.09 0.28 0.41 0.20 6.59 7.67 0.69 5.20 0.00 0.73 14.81 0.00 1.12 1.14
Daniel Rich Brisbane Lions 75 Defender 14.99 5.47 20.45 77.93 0.19 0.27 0.00 1.83 5.16 2.85 1.13 2.45 0.76 0.79 5.43 12.76 0.05 0.27 0.09 2.33 0.04 0.31 82.36 0.27 0.87 3.85 446.96 3.92 5.40 0.05 0.25 0.95 0.01 0.01 6.43 11.49 0.11 4.09 0.00 1.28 9.61 0.00 1.45 1.81

Deploying the Shiny App

With this data, we can build out an application to create a dynamic, interactive scatterplot

For those not au-fait with R, I have hosted the app on Shiny Apps. As I am using a free-account, this link is subject to bandwidth limits, & may not work!

Alternatively, the full code to deploy this app locally can be found at my GitHub page

Clicking the app-link, or running this script will produce something like the below:

Alt text In the left hand panel, there are several parameters that can be changed. Users can select any of the included statistics to compare on the scatterplot

Players can be further filtered by their team, position, or the number of games they have played between 2017 and 2020

Visualisation & Exploration of the data

Although the above Shiny app is the definitive way to explore this data, it can just as easily be replicated manually with ggplot2 & plotly

In the below section, I will showcase some player/metric combinations I found to be interesting

To begin, let’s create a generic ggplot theme for visualization

Plot_Theme2 <-
  theme_minimal() + # start with a minimal theme and add what we need
  theme(text = element_text(color = "gray20"),
        legend.position = "top", # position the legend in the upper left 
        legend.direction = "horizontal",
        legend.justification = 0.1, # anchor point for legend.position.
        legend.title = element_text(size = 10, color = "gray20"),
        legend.text = element_text(size = 10, color = "gray20"),
        axis.text = element_text(face = "italic"),
        axis.title.x = element_text(vjust = -1, size = 10), # move title away axis
        axis.title.y = element_text(vjust = 2, size = 10), # move away for axis
        axis.ticks.y = element_blank(), # remove elements
        axis.line = element_line(color = "gray20", size = 0.5),
        axis.line.y = element_blank(),
        panel.grid.major = element_line(color = "gray80", size = 0.5),
        panel.grid.minor = element_line(color = "gray80", size = 0.5),
        panel.grid.major.x = element_blank(),
        plot.title = element_text(size = 11)
  )

& create a number of custom text-overlays for specific players, so their data-point can be directly identified on the plot

Buddy <- player %>%
  filter(Player_Name %in% c("Lance Franklin"))

McGovern <- player %>%
  filter(Player_Name %in% c("Jeremy McGovern"))

SaadMcKenna <- player %>%
  filter(Player_Name %in% c("Adam Saad",
                            "Conor McKenna"))

In the spirit of The Flying Bulldozer - let’s introduce three further sobriquets

Jeremy McGovern - “The Roaming Backboard”

Jeremy McGovern is considered by many to be the premium spare defender in the competition, with a remarkable knack for reading the ball in the air. The stats emphatically support this

p1 <- player %>% 
  filter(Games >= 22) %>% 
  ggplot(aes(label = Player_Name,
             label2 = Games,
             color = PositionType))
p1 <- p1 + geom_point(aes(intercept_marks, 
                        contested_marks), 
                    shape = 16, 
                    size = 3, 
                    alpha = 4/10) + 
  geom_text(data = McGovern, 
            size = 2, 
            colour = "#730202",
            aes(intercept_marks, contested_marks)) +
  scale_colour_manual(values=c("#b80909", 
                               "#0d6cb5", 
                               "#e0842d", 
                               "#31ad28", 
                               "#e3c609")) +
  Plot_Theme2
p1 <- ggplotly(p1)
p1

Not only is Jeremy McGovern in the top handful of players for average contested marks per game (& far & away the best defender on this metric), but he also averages more intercept marks than any player in the competition

This makes Jeremy McGovern a clear outlier.When balls are launched into defensive 50, they are more often than not going to be repelled by the best defensive, contested, aerialist in the business - the roaming backboard

Buddy Franklin - “The Scoring Beast”

Buddy, & his accolades as a player need no introduction. No-one would argue there is a better current-day forward in the competition.

It’s stark just how superior Buddy’s scoring power is to other players

p2 <- player %>% 
  filter(Games >= 22) %>% 
  ggplot(aes(label = Player_Name,
             label2 = Games,
             color = PositionType))
p2 <- p2 + geom_point(aes(goals, 
                        score_involvements), 
                    shape = 16, 
                    size = 3, 
                    alpha = 4/10) + 
  geom_text(data = Buddy, 
            size = 2, 
            colour = "#07406b",
            aes(goals, score_involvements)) +
  scale_colour_manual(values=c("#b80909", 
                               "#0d6cb5", 
                               "#e0842d", 
                               "#31ad28", 
                               "#e3c609")) +
  Plot_Theme2
p2 <- ggplotly(p2)
p2

Buddy not only kicks more goals than any other player, he is also directly involved in more scoring chains, helping others around him to hit the scoreboard. It may be stating the painfully obvious, but Buddy is a scoring beast

Hopefully we will see an injury free Lance Franklin in 2021

Adam Saad & Conor McKenna - “The Bouncing-Bomber-Brothers”

There are not many more exciting sights in the AFL than a player streaming down the field at pace, thumping the ball into the turf

Over the past four seasons, only 5 players have averaged more than 1.5 running bounces per game, & of these 5, there are two clear outliers. The gentlemen in question are Conor McKenna & Adam Saad - who interestingly are both running defenders for Essendon. For this reason they are ordained the bouncing bomber brothers

p4 <- player %>% 
  filter(Games >= 22) %>% 
  ggplot(aes(label = Player_Name,
             label2 = Games,
             color = PositionType))
p4 <- p4 + geom_point(aes(bounces, 
                          metres_gained), 
                      shape = 16, 
                      size = 3, 
                      alpha = 4/10) + 
  geom_text(data = SaadMcKenna, 
            size = 2, 
            colour = "#730202",
            aes(bounces, metres_gained)) +
  scale_colour_manual(values=c("#b80909", "#0d6cb5", "#e0842d", "#31ad28", "#e3c609")) +
  Plot_Theme2
p4 <- ggplotly(p4)
p4

The ‘brothers’ also average ~ 350m gained per match, so its fair to say their fleetness of foot has been essential to Essendon’s territory game

Sadly Conor McKenna has announcement his retirement & return to Ireland. While speculation is mounting Adam Saad may request a trade away from Essendon. The brotherhood is due to end just as soon as it has been anointed

For anyone reading, I will be interested in any other insights to come from this app!